# COMPSCI 389: Introduction to Machine Learning
# Topic 2.1: Pandas and Data Sets

This notebook provides a description of how data sets are represented and manipulated using the `pandas` library.

## What is pandas?

Pandas stands for "PANel DAta," an econometric term for data sets. Webpage: [link](https://pandas.pydata.org/docs/index.html).

It provides two main objects: a **DataFrame** and a **Series**.

A DataFrame object stores a 2-dimensional table of data, while a Series stores a 1-dimensional vector of data.

Pandas provides useful functions for working with these objects including functions for:
1. Loading data sets from files and storing them in DataFrame and/or Series objects.
2. Manipulating DataFrame and Series objects (e.g., adding or removing features).
3. Computing statistics of the data (e.g., the minimum and maximum values of features).

Pandas has become so common that many other ML libraries in python are built to be compatible with pandas, as we will see below.

To install pandas, run the following command in the console or command line:

> pip install pandas

## Example Data Sets

In the remainder of this notebook be load and inspect a few example data sets for supervised learning.

## GPA Data

The GPA data set contains data about undergraduate students and the *Universidade Federal do Rio Grande do Sul* (UFRGS) in Brazil.

**Input**: Scores on 9 entrance exams: 
1. Physics
2. Biology
3. History
4. English
5. Geography
6. Literature
7. Portuguese
8. Math
9. Chemistry

**Output**: GPA on a 4.0 scale during the first three semesters at university.
 - The GPA can be used for regression (predict the GPA) or classification (predict the GPA range, e.g., whether it is at least 3.0).

**Data set Size**: 43,303

Let's start by loading and displaying this data set. The data set is available here:

[https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv](https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv)

You can download it and place it inside a directory called `data`, next to this .ipynb file, and can load the data set from this local copy, or you can directly load it from the online posting:

In [1]:
import pandas as pd                             # Import pandas

# Load the data set directly from the online link, assuming numbers are separated by commas
df = pd.read_csv("https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas

# Load the data set from a local `data` directory, assuming numbers are separated by commas
# df = pd.read_csv("data/GPA.csv", delimiter=',')

# print(df)                                     # Prints a string representation of the DataFrame
display(df)                                     # Renders an HTML table (for Jupyter Notebooks - don't use in .py file)

Unnamed: 0,physics,biology,history,English,geography,literature,Portuguese,math,chemistry,gpa
0,622.60,491.56,439.93,707.64,663.65,557.09,711.37,731.31,509.80,1.33333
1,538.00,490.58,406.59,529.05,532.28,447.23,527.58,379.14,488.64,2.98333
2,455.18,440.00,570.86,417.54,453.53,425.87,475.63,476.11,407.15,1.97333
3,756.91,679.62,531.28,583.63,534.42,521.40,592.41,783.76,588.26,2.53333
4,584.54,649.84,637.43,609.06,670.46,515.38,572.52,581.25,529.04,1.58667
...,...,...,...,...,...,...,...,...,...,...
43298,519.55,622.20,660.90,543.48,643.05,579.90,584.80,581.25,573.92,2.76333
43299,816.39,851.95,732.39,621.63,810.68,666.79,705.22,781.01,831.76,3.81667
43300,798.75,817.58,731.98,648.42,751.30,648.67,662.05,773.15,835.25,3.75000
43301,527.66,443.82,545.88,624.18,420.25,676.80,583.41,395.46,509.80,2.50000


**Question**: Is each column numerical, categorical, text, or an image? Continuous, discrete, nominal, or ordinal?

**Answer**: All of these columns are numerical and continuous.

**Question**: If the GPAs were binned into letter grades A, B, C, ..., F, would they be numerical, categorical, text, or an image? Continuous, discrete, nominal, or ordinal?

**Answer**: In this case the GPAs would be categorical, and specifically ordinal.

Notice that pandas views this as a table with rows and columns. Hence features *and* labels are viewed as "columns" when using pandas.

## Manipulating DataFrames

In this section we give some examples of how DataFrames can be used to compute statistics of data and how DataFrames can be manipulated.

First, let's use the `iloc` (integer-location based indexing for selection by position) function in pandas to split the dataset into the input features $X$ and the targets/labels $y$. 

In [2]:
X = df.iloc[:, :-1] # All columns except the last as features. This creates a new DataFrame X.
print(type(X))      # Confirm that this is actually a new DataFrame by printing the type of X.
y = df.iloc[:, -1]  # The last column contains the labels. This creates a new Series (like a 1-dimensional DataFrame) y
display(X)          # Display the input columns
display(y)          # Display the output (label) column

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,physics,biology,history,English,geography,literature,Portuguese,math,chemistry
0,622.60,491.56,439.93,707.64,663.65,557.09,711.37,731.31,509.80
1,538.00,490.58,406.59,529.05,532.28,447.23,527.58,379.14,488.64
2,455.18,440.00,570.86,417.54,453.53,425.87,475.63,476.11,407.15
3,756.91,679.62,531.28,583.63,534.42,521.40,592.41,783.76,588.26
4,584.54,649.84,637.43,609.06,670.46,515.38,572.52,581.25,529.04
...,...,...,...,...,...,...,...,...,...
43298,519.55,622.20,660.90,543.48,643.05,579.90,584.80,581.25,573.92
43299,816.39,851.95,732.39,621.63,810.68,666.79,705.22,781.01,831.76
43300,798.75,817.58,731.98,648.42,751.30,648.67,662.05,773.15,835.25
43301,527.66,443.82,545.88,624.18,420.25,676.80,583.41,395.46,509.80


0        1.33333
1        2.98333
2        1.97333
3        2.53333
4        1.58667
          ...   
43298    2.76333
43299    3.81667
43300    3.75000
43301    2.50000
43302    3.16667
Name: gpa, Length: 43303, dtype: float64

Notice that the variable `y` displays differently from `X`. This is because `y` is a Series (1-dimensional vector), while `X` is a DataFrame (2-dimensional matrix/table).

Also, in the output of the above block, `float64` means that each element in the `y` Series is a floating point number represented with 64 bits.